69 research outputs found

    HMM-Based Speech Synthesis Utilizing Glottal Inverse Filtering

    Get PDF

    Comparing human and automatic speech recognition in a perceptual restoration experiment

    Get PDF
    Speech that has been distorted by introducing spectral or temporal gaps is still perceived as continuous and complete by human listeners, so long as the gaps are filled with additive noise of sufficient intensity. When such perceptual restoration occurs, the speech is also more intelligible compared to the case in which noise has not been added in the gaps. This observation has motivated so-called 'missing data' systems for automatic speech recognition (ASR), but there have been few attempts to determine whether such systems are a good model of perceptual restoration in human listeners. Accordingly, the current paper evaluates missing data ASR in a perceptual restoration task. We evaluated two systems that use a new approach to bounded marginalisation in the cepstral domain, and a bounded conditional mean imputation method. Both methods model available speech information as a clean-speech posterior distribution that is subsequently passed to an ASR system. The proposed missing data ASR systems were evaluated using distorted speech, in which spectro-temporal gaps were optionally filled with additive noise. Speech recognition performance of the proposed systems was compared against a baseline ASR system, and with human speech recognition performance on the same task. We conclude that missing data methods improve speech recognition performance in a manner that is consistent with perceptual restoration in human listeners

    Comparing glottal-flow-excited statistical parametric speech synthesis methods

    Get PDF

    Sensitivity of the human auditory cortex to acoustic degradation of speech and non-speech sounds

    Get PDF
    The perception of speech is usually an effortless and reliable process even in highly adverse listening conditions. In addition to external sound sources, the intelligibility of speech can be reduced by degradation of the structure of speech signal itself, for example by digital compression of sound. This kind of distortion may be even more detrimental to speech intelligibility than external distortion, given that the auditory system will not be able to utilize sound source-specific acoustic features, such as spatial location, to separate the distortion from the speech signal. The perceptual consequences of acoustic distortions on speech intelligibility have been extensively studied. However, the cortical mechanisms of speech perception in adverse listening conditions are not well known at present, particularly in situations where the speech signal itself is distorted. The aim of this thesis was to investigate the cortical mechanisms underlying speech perception in conditions where speech is less intelligible due to external distortion or as a result of digital compression. In the studies of this thesis, the intelligibility of speech was varied either by digital compression or addition of stochastic noise. Cortical activity related to the speech stimuli was measured using magnetoencephalography (MEG). The results indicated that degradation of speech sounds by digital compression enhanced the evoked responses originating from the auditory cortex, whereas addition of stochastic noise did not modulate the cortical responses. Furthermore, it was shown that if the distortion was presented continuously in the background, the transient activity of auditory cortex was delayed. On the perceptual level, digital compression reduced the comprehensibility of speech more than additive stochastic noise. In addition, it was also demonstrated that prior knowledge of speech content enhanced the intelligibility of distorted speech substantially, and this perceptual change was associated with an increase in cortical activity within several regions adjacent to auditory cortex. In conclusion, the results of this thesis show that the auditory cortex is very sensitive to the acoustic features of the distortion, while at later processing stages, several cortical areas reflect the intelligibility of speech. These findings suggest that the auditory system rapidly adapts to the variability of the auditory environment, and can efficiently utilize previous knowledge of speech content in deciphering acoustically degraded speech signals.Puheen havaitseminen on useimmiten vaivatonta ja luotettavaa myös erittÀin huonoissa kuunteluolosuhteissa. Puheen ymmÀrrettÀvyys voi kuitenkin heikentyÀ ympÀristön hÀiriölÀhteiden lisÀksi myös silloin, kun puhesignaalin rakennetta muutetaan esimerkiksi pakkaamalla digitaalista ÀÀntÀ. TÀllainen hÀiriö voi heikentÀÀ ymmÀrrettÀvyyttÀ jopa ulkoisia hÀiriöitÀ voimakkaammin, koska kuulojÀrjestelmÀ ei pysty hyödyntÀmÀÀn ÀÀnilÀhteen ominaisuuksia, kuten ÀÀnen tulosuuntaa, hÀiriön erottelemisessa puheesta. Akustisten hÀiriöiden vaikutuksia puheen havaitsemiseen on tutkttu laajalti, mutta havaitsemiseen liittyvÀt aivomekanismit tunnetaan edelleen melko puutteelisesti etenkin tilanteissa, joissa itse puhesignaali on laadultaan heikentynyt. TÀmÀn vÀitöskirjan tavoitteena oli tutkia puheen havaitsemisen aivomekanismeja tilanteissa, joissa puhesignaali on vaikeammin ymmÀrrettÀvissÀ joko ulkoisen ÀÀnilÀhteen tai digitaalisen pakkauksen vuoksi. VÀitöskirjan neljÀssÀ osatutkimuksessa lyhyiden puheÀÀnien ja jatkuvan puheen ymmÀrrettÀvyyttÀ muokattiin joko digitaalisen pakkauksen kautta tai lisÀÀmÀllÀ puhesignaaliin satunnaiskohinaa. PuheÀrsykkeisiin liittyvÀÀ aivotoimintaa tutkittiin magnetoenkefalografia-mittauksilla. Tutkimuksissa havaittiin, ettÀ kuuloaivokuorella syntyneet herÀtevasteet voimistuivat, kun puheÀÀntÀ pakattiin digitaalisesti. Sen sijaan puheÀÀniin lisÀtty satunnaiskohina ei vaikuttanut herÀtevasteisiin. Edelleen, mikÀli puheÀÀnien taustalla esitettiin jatkuvaa hÀiriötÀ, kuuloaivokuoren aktivoituminen viivÀstyi hÀiriön intensiteetin kasvaessa. Kuuntelukokeissa havaittiin, ettÀ digitaalinen pakkaus heikentÀÀ puheÀÀnien ymmÀrrettÀvyyttÀ voimakkaammin kuin satunnaiskohina. LisÀksi osoitettiin, ettÀ aiempi tieto puheen sisÀllöstÀ paransi merkittÀvÀsti hÀiriöisen puheen ymmÀrrettÀvyyttÀ, mikÀ heijastui aivotoimintaan kuuloaivokuoren viereisillÀ aivoalueilla siten, ettÀ ymmÀrrettÀvÀ puhe aiheutti suuremman aktivaation kuin heikosti ymmÀrrettÀvÀ puhe. VÀitöskirjan tulokset osoittavat, ettÀ kuuloaivokuori on erittÀin herkkÀ puheÀÀnien akustisille hÀiriöille, ja myöhemmissÀ prosessoinnin vaiheissa useat kuuloaivokuoren viereiset aivoalueet heijastavat puheen ymmÀrrettÀvyyttÀ. Tulosten mukaan voi olettaa, ettÀ kuulojÀrjestelmÀ mukautuu nopeasti ÀÀniympÀristön vaihteluihin muun muassa hyödyntÀmÀllÀ aiempaa tietoa puheen sisÀllöstÀ tulkitessaan hÀiriöistÀ puhesignaalia

    Atypical perceptual narrowing in prematurely born infants is associated with compromised language acquisition at 2 years of age

    Get PDF
    Background: Early auditory experiences are a prerequisite for speech and language acquisition. In healthy children, phoneme discrimination abilities improve for native and degrade for unfamiliar, socially irrelevant phoneme contrasts between 6 and 12 months of age as the brain tunes itself to, and specializes in the native spoken language. This process is known as perceptual narrowing, and has been found to predict normal native language acquisition. Prematurely born infants are known to be at an elevated risk for later language problems, but it remains unclear whether these problems relate to early perceptual narrowing. To address this question, we investigated early neurophysiological phoneme discrimination abilities and later language skills in prematurely born infants and in healthy, full-term infants. Results: Our follow-up study shows for the first time that perceptual narrowing for non-native phoneme contrasts found in the healthy controls at 12 months was not observed in very prematurely born infants. An electric mismatch response of the brain indicated that whereas full-term infants gradually lost their ability to discriminate non-native phonemes from 6 to 12 months of age, prematurely born infants kept on this ability. Language performance tested at the age of 2 years showed a significant delay in the prematurely born group. Moreover, those infants who did not become specialized in native phonemes at the age of one year, performed worse in the communicative language test (MacArthur Communicative Development Inventories) at the age of two years. Thus, decline in sensitivity to non-native phonemes served as a predictor for further language development. Conclusion: Our data suggest that detrimental effects of prematurity on language skills are based on the low degree of specialization to native language early in development. Moreover, delayed or atypical perceptual narrowing was associated with slower language acquisition. The results hence suggest that language problems related to prematurity may partially originate already from this early tuning stage of language acquisition

    Using group delay functions from all-pole models for speaker recognition

    Get PDF
    Bu çalÄ±ĆŸma, 25-29 Ağustos 2013 tarihlerinde Lyon[Fransa]'da dĂŒzenlenen 14. Annual Conference of the International Speech Communication Association [Interspeech 2013]'da bildiri olarak sunulmuƟtur.Popular features for speech processing, such as mel-frequency cepstral coefficients (MFCCs), are derived from the short-term magnitude spectrum, whereas the phase spectrum remains unused. While the common argument to use only the magnitude spectrum is that the human ear is phase-deaf, phase-based features have remained less explored due to additional signal processing difficulties they introduce. A useful representation of the phase is the group delay function, but its robust computation remains difficult. This paper advocates the use of group delay functions derived from parametric all-pole models instead of their direct computation from the discrete Fourier transform. Using a subset of the vocal effort data in the NIST 2010 speaker recognition evaluation (SRE) corpus, we show that group delay features derived via parametric all-pole models improve recognition accuracy, especially under high vocal effort. Additionally, the group delay features provide comparable or improved accuracy over conventional magnitude-based MFCC features. Thus, the use of group delay functions derived from all-pole models provide an effective way to utilize information from the phase spectrum of speech signals.Academy of Finland (253120)Int Speech Commun AssociationAmazonMicrosoftGoogleTcL SYTRALEuropean Language Resources AssociationOuaeroImaginoveVOCAPIA ResearchAcapelaSpeech OceanALDEBARANOrangeVecsysIBM ResearchRaytheon BBN TechnologyVoxyge

    Early detection of continuous and partial audio events using CNN

    Get PDF
    Sound event detection is an extension of the static auditory classification task into continuous environments, where performance depends jointly upon the detection of overlapping events and their correct classification. Several approaches have been published to date which either develop novel classifiers or employ well-trained static classifiers with a detection front-end. This paper takes the latter approach, by combining a proven CNN classifier acting on spectrogram image features, with time-frequency shaped energy detection that identifies seed regions within the spectrogram that are characteristic of auditory energy events. Furthermore, the shape detector is optimised to allow early detection of events as they are developing. Since some sound events naturally have longer durations than others, waiting until completion of entire events before classification may not be practical in a deployed system. The early detection capability of the system is thus evaluated for the classification of partial events. Performance for continuous event detection is shown to be good, with accuracy being maintained well when detecting partial events

    PREMIUM, a benchmark on the quantification of the uncertainty of the physical models in the system thermal-hydraulic codes: methodologies and data review

    Get PDF
    The objective of the Post-BEMUSE Reflood Model Input Uncertainty Methods (PREMIUM) benchmark is to progress on the issue of the quantification of the uncertainty of the physical models in system thermalhydraulic codes by considering a concrete case: the physical models involved in the prediction of core reflooding. The present document was initially conceived as a final report for the Phase I “Introduction and Methodology Review” of the PREMIUM benchmark. The objective of Phase I is to refine the definition of the benchmark and publish the available methodologies of model input uncertainty quantification relevant to the objectives of the benchmark. In this initial version the document was approved by WGAMA and has shown its usefulness during the subsequent phases of the project. Once Phase IV was completed, and following the suggestion of WGAMA members, the document was updated adding a few new sections, particularly the description of four new methodologies that were developed during this activity. Such developments were performed by some participants while contributing to PREMIUM progress (which is why this report arrives after those of other phases). After this revision the document title was changed to “PREMIUM methodologies and data review”. The introduction includes first a chapter devoted to contextualization of the benchmark in nuclear safety research and licensing, followed by a description of the PREMIUM objectives. Next, a description of the Phases in which the benchmark is divided and its organization is explained. Chapter two consists of a review of the involvement of the different participants, making a brief explanation of the input uncertainty quantification methodologies used in the activity. The document ends with some conclusions on the development of Phase I, some more general remarks and some statements on the benefits of the benchmark, which can be briefly summarized as it follows: - Contribution to development of tools and experience related to uncertainty calculation and promotion of the use of BEPU approaches for licensing and safety assessment purposes; - Contribution to prioritization of improvements to thermal-hydraulic system codes; - Contribution to a fluent and close interaction between the scientific community and regulatory organizations. Appendices include the complete description of the experimental data FEBA/SEFLEX used in the benchmark and the methodologies CIRCÉ and FFTBM and the general requirements and description specification used for Phase I. Due to the revision of the document, four extra appendixes have been added related to the methods developed during the activity, MCDA DIPE, Tractebel IUQ and PSI methods

    Non-hexagonal neural dynamics in vowel space

    Get PDF
    Are the grid cells discovered in rodents relevant to human cognition? Following up on two seminal studies by others, we aimed to check whether an approximate 6-fold, grid-like symmetry shows up in the cortical activity of humans who "navigate" between vowels, given that vowel space can be approximated with a continuous trapezoidal 2D manifold, spanned by the first and second formant frequencies. We created 30 vowel trajectories in the assumedly flat central portion of the trapezoid. Each of these trajectories had a duration of 240 milliseconds, with a steady start and end point on the perimeter of a "wheel". We hypothesized that if the neural representation of this "box" is similar to that of rodent grid units, there should be an at least partial hexagonal (6-fold) symmetry in the EEG response of participants who navigate it. We have not found any dominant n-fold symmetry, however, but instead, using PCAs, we find indications that the vowel representation may reflect phonetic features, as positioned on the vowel manifold. The suggestion, therefore, is that vowels are encoded in relation to their salient sensory-perceptual variables, and are not assigned to arbitrary gridlike abstract maps. Finally, we explored the relationship between the first PCA eigenvector and putative vowel attractors for native Italian speakers, who served as the subjects in our study
    • 

    corecore